Skip to content

feat: in-place task restart on recoverable task failure#7093

Open
AdilFayyaz wants to merge 2 commits intov2from
adil/in-place-pod-retry-handler
Open

feat: in-place task restart on recoverable task failure#7093
AdilFayyaz wants to merge 2 commits intov2from
adil/in-place-pod-retry-handler

Conversation

@AdilFayyaz
Copy link

@AdilFayyaz AdilFayyaz commented Mar 25, 2026

Why are the changes needed?

When a task pod fails with a recoverable error, the executor relied on the runs service to create a new TaskAction CR for each retry. This external orchestration was broken, leaving retryable tasks stuck in a non-terminal RetryableFailure loop indefinitely.

What changes were proposed in this pull request?

  • Intercepts PhaseRetryableFailure in TaskActionReconciler.Reconcile() before updating status
  • If retries remain (currentAttempts < maxAttempts): calls p.Abort() to delete the failed pod, increments Status.Attempts, clears PluginState, and overrides the transition to PhaseQueued — causing the next reconcile to launch a fresh pod under the same TaskAction
  • If retries are exhausted: converts PhaseRetryableFailure → PhasePermanentFailure, making the TaskAction terminal (Failed=True)
  • Pod naming is naturally collision-free: buildGeneratedName encodes the attempt number (-0, -1, ...), so each restart creates a distinct pod

How was this patch tested?

  • Deploy a task with retries: 2 (maxAttempts=3) that always exits non-zero
  • Verify pods [name]-0, [name]-1, [name]-2 are each created and deleted in sequence:
    kubectl get pods -n [namespace] | grep [taskaction-name]
  • Verify Status.Attempts increments on each failure:
    kubectl get taskaction [name] -n [namespace] -o jsonpath='{.status.attempts}'
  • After the third failure, verify the TaskAction is terminal:
    kubectl get taskaction [name] -n [namespace] -o jsonpath='{.status.conditions}'

Failed=True, Reason=PermanentFailure

  • Deploy a task with retries: 0 — confirm first failure immediately produces PermanentFailure

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz self-assigned this Mar 25, 2026
@AdilFayyaz AdilFayyaz added added Merged changes that add new functionality flyte2 labels Mar 25, 2026
@github-actions github-actions bot mentioned this pull request Mar 25, 2026
3 tasks
@AdilFayyaz AdilFayyaz requested a review from pingsutw March 25, 2026 22:00
@AdilFayyaz AdilFayyaz changed the title feat: in-place pod restart on recoverable task failure feat: in-place task restart on recoverable task failure Mar 25, 2026
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

added Merged changes that add new functionality flyte2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants